Bresenham's Line 
Algorithm impiemented for 
tiie NS32GX32 



1.0 iNTRODUCTiON 

Even with today's achievements in graphics technology, the 
resolution of computer graphics systems will never reach 
that of the real world. A true real line can never be drawn on 
a laser printer or CRT screen. There is no method of accu- 
rately printing all of the points on the continuous line de- 
scribed by the equation y = mx + b. Similarly, circles, ellip- 
ses and other geometrical shapes cannot truly be imple- 
mented by their theoretical definitions because the graphics 
system itself is discrete, not real or continuous. For that 
reason, there has been a tremendous amount of research 
and development in the area of discrete or raster mathemat- 
ics. Many algorithms have been developed which "map" 
real-world images into the discrete space of as raster de- 
vice. Bresenham's line-drawing algorithm (and its deriva- 
tives) is one of the most commonly used algorithms today 
for describing a line on a raster device. The agorithm was 
first published in Bresenham's 1965 article entitled "Algo- 
rithm for Computer Control of a Digital Plotter". It is now 
widely used in graphics and electronic printing systems. 
This application note describes the fundamental algorithm 
and shows an implementation specially tuned for the 
NS32GX32 microprocessor. Although given in the context 
of this specific application note, the assembly level opti- 
mizations are relevant to general programming for the 
NS32GX32. Timing figures are given in Appendix C. 

2.0 DESCRiPTiON 

Bresenham's line-drawing algorithm uses an iterative 
scheme. A pixel is plotted at the starting coordinate of the 
line, and each iteration of the algorithm increments the pixel 
one unit along the major, or x-axis. The pixel is incremented 
along the minor, or y-axis, only when a decision variable 
(based on the slope of the line) changes sign. A key feature 
of the algorithm is that it requires only integer data and sim- 
ple arithmetic. This makes the algorithm very efficient and 
fast. 
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The algorithm assumes the line has positive slope less than 
one, but a simple change of variables can modify the algo- 
rithm for any slope value. This will be detailed in Section 2.2. 

2.1 Bresenham's Aigorithm for < slope < 1 

Figure 1 shows a line segment superimposed on a raster 
grid with horizontal axis X and vertical axis Y. Note that Xj 
and yj are the integer abscissa and ordinate respectively of 
each pixel location on the grid. 

Given (Xj, yj) as the previously plotted pixel location for the 
line segment, the next pixel to be plotted is either (Xj + 1 , yj) 
or (Xj + 1, yj + 1). Bresenham's algorithm determines 
which of these two pixel locations is nearer to the actual line 
by calculating the distance from each pixel to the line, and 
plotting that pixel with the smaller distance. Using the famil- 
iar equation of a straight line, y = mx + b, the y value 
corresponding to Xj + 1 is 

y = m(Xj + 1) + b 

The two distances are then calculated as: 

d1 = y - yj 

d1 = m(Xj + 1) + b - yj 

d2 = (yi+ 1)-y 

d2 = (yj + 1) - m(Xi + 1) - b 

and, 

d1 - d2 = m(Xj + 1) + b - yj - (yj + 1) + m(Xi + 1) + b 

d1 - d2 = 2m(Xi + 1) - 2yi + 2b - 1 
Multiplying this result by the constant dx, defined by the 
slope of the line m = dy/dx, the equation becomes: 

dx(d1 -d2) = 2dy(Xi) - 2dx(yi) + c 
where c is the constant 2dy + 2dxb - dx. Of course, if 
d2 > d1 , then (d1 - d2) < 0, or conversely if d1 > d2, then 
(d1 - d2) > 0. Therefore, a parameter pj can be defined 
such that 

Pj = dx(d1 - d2) 
Pj = 2dy(Xj) - 2dx(yi) + c 
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Distances d1 and d2 are compared. 
The smaller distance marks next pixel to be plotted. 
FiGURE 2 
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If pj > 0, then d1 > d2 and yj + -| is chosen such that the 
next plotted pixel is (Xj + 1 , yj). Otherwise, if pj < 0, then 
d2 > d1 and (Xj + 1, yj + 1) is plotted. (See Figure 2.) 
Similarly, for the next iteration, pj + i can be calculated and 
compared with zero to determine the next pixel to plot. If 
Pj+1 < 0, then the next plotted pixel is at (Xj + i + 1, yj + i); 
if Pj + i > 0, then the next point is (Xj+-| + 1, yi + i + 1). 
Note that in the equation for pj + i, Xj + -| = X| + 1. 

Pi+ 1 = 2dy(Xi + 1) -2dx(yi + i) + c 
Subtracting pj from pj+ 1, we get the recursive equation: 

Pi + 1 = Pi + 2dy-2dx(yi + i - yi) 
Note that the constant c has conveniently dropped out of 
the formula. And, if pi < then yi+-| = yj in the above 
equation, so that: 

Pi + 1 = Pi + 2dy 
or, if Pi > then yj + -| = yj + 1, and 

Pi + 1 = Pi + 2(dy- dx) 
To further simplify the iterative algorithm, constants c1 and 
c2 can be initialized at the beginning of the program such 
that c1 = 2dy and c2 = 2(dy - dx). Thus, the actual meat 
of the algorithm is a loop of length dx, containing only a few 
integer additions and two compares (Figure 3). 

2.2 For Slope < and |Slope| > 1 

The algorithm fails when the slope is negative or has abso- 
lute value greater than one (|dy| > |dx|). The reason for this 
is that the line will always be plotted with a positive slope if 
Xi and yi are always incremented in the positive direction, 
and the line will always be "shorted" if |dx| < |dy| since the 
algorithm executes once for every x coordinate (i.e., dx 
times). However, a closer look at the algorithm must be tak- 
en to reveal that a few simple changes of variables will take 
care of these special cases. 



For negative slopes, the change is simple. Instead of incre- 
menting the pixel along the positive direction (^ 1) for each 
iteration, the pixel is incremented in the negative direction. 
The relationship between the starting point and the finishing 
point of the line determines which axis is followed in the 
negative direction, and which is in the positive. Figure 4 
shows all the possible combinations for slopes and starting 
points, and their respective incremental directions along the 
X and Y axis. 

Another change of variables can be performed on the incre- 
mental values to accommodate those lines with slopes 
greater than 1 or less than - 1 . The coordinate system con- 
taining the line is rotated 90 degrees so that the X-axis now 
becomes the Y-axis and vice versa. The algorithm is then 
performed on the rotated line according to the sign of its 
slope, as explained above. Whenever the current position is 
incremented along the X-axis in the rotated space, it is actu- 
ally incremented along the Y-axis in the original coordinate 
space. Similarly, an increment along the Y-axis in the rotat- 
ed space translates to an increment along the X-axis in the 
original space. Figures 4a., g, and ii. illustrate this transla- 
tion process for both positive and negative lines with various 
starting points. 



do while count <> dx 
if (p < 0) then p+ 
else 



cl 



p+ = c2 

next_y = prev_y + y_inc 
next_x = prev_x + x_inc 
plot (next_x,next_y) 
count += 1 
/* PSEUDO CODE FOR BRESENHAM LOOP */ 
FIGURES 




start p1 : x inc = y' inc = 

y inc = x' inc = + 1 

start p2: x inc = y' inc = 

y inc = x' inc = -1 



start p1 : x inc = + 1 

y inc = 

~p2 start p2: x inc = - 1 

y inc = 
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Start p1 : x inc = + 1 

y inc = - 1 

start p2: x inc = - 1 

y inc = + 1 




start p1 : x inc = + 1 

y inc = + 1 

start p2: x inc = - 1 

y inc = - 1 
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Start p1 : x inc = + 1 

y inc = -1 

start p2: x inc = - 1 

y inc = + 1 
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Start p1 : x inc = - 1 

y inc = - 1 

start p2: x inc = - 1 

y inc = - 1 
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start p1 : x inc = y' inc = — 1 

y inc = x' inc = — 1 

start p2: x inc = y' inc = - 1 

y inc = x' inc = - 1 



start p1:x inc = y' inc! = — 1 

y inc = x' inc = + 1 

start p2: x inc = y' inc = + 1 

y inc = x' inc = — 1 
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Note: a., g., and li. are rotated 90 degrees left and x', y' refer to the original axis. 

FIGURE 4 
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Bit Map is 500,000 bytes, 2000 x 2000 Bits 


Base Address of the Bit IVIap is "_bi1_map" 


FIGURES 


3.0 IMPLEMENTATION IN C However, one of the characteristics of the NS32GX32 is 


Bresenham's algorithm is easily implemented in most pro- ^hat although the average throughput of its pipeline is 3.5 


gramming languages. Appendix A gives a C routine imple- clock cycles per instruction, there are several instructions 


menting the algorithm whose execution time is much longer than this. Further im- 


The routine accepts as parameters the line's start and end provement in execution time may be achieved by avoiding 
coordinates, (xs, ys) and (xf, yf). It plots the line on a bit-map '^^'^ ^^'^^'^^'y "^xP^^^'ve" -nstrucfons. 


allocated in memory. The dimensions of the bit map are We will demonstrate here the replacement of the relatively 


given in the include file "bres.h". expensive instructions "sbit" and "mul". 


The program uses the variable bit to keep track of the cur- 4.1.1 Replacing the "SBIT" Instruction 


rent pixel position within the 2000 x 2000 bit map (Figure 5). jhe most straightforward coding of the Bresenham algo- 


Note the macro definition for setting a bit in memory: rithm would use the NS32000 "sbit" instruction for setting 


#define sbit(buffer, pos) (buffer) [(pos) >> 3] | = 1< < the current bit of the line (see AN-524: "Introduction to Bre- 


(((char)pos)&7) senham's Line Algorithm using the SBIT Instruction"). 


Bit "pos" is set by calculating pos MOD 8, which is the "^^'^"' h°^^^^^' ^^^^^ ^^°^^ ^^ ^y^'^^ ^° ^^^^^^^^ ^^^^^^^ 


same as (pos & 7). Then 1 is shifted by this amount to ^^^ ^'^ ^^"'"9 '^ ^ significant part of the routine's mam loop. 


obtain a mask of the bit to set in byte pos/8 of "buffer", '^ '^ important to optimize it. 


which is the same as (pos > > 3). In our assembly routine, we replace the "sbitd 


r1, bit map" instruction with the following sequence. 


4.0 IMPLEMENTATION IN SERIES 32000® ASSEMBLY, ^^ich saves an average of 3.5 cycles per bit set, or about 


SPECIALLY TUNED FOR THE NS32GX32 3% of the main loop time. This sequence is essentially an 


This section demonstrates several kinds of assembly level implementation of the calculation defined in the "sbit" mac- 


optimizations for speeding up NS32GX32 programs. These ro used in the C language routine. 


take into account the execution times of different instruc- ^^^^ ^]_ ^4 


tions, data dependency between registers, the NS32GX32 ^^ ^,1. >, „ >, ,.. ^0 ^.^ 
on-chip cache, and the NS32GX32 branch prediction meth- ^^^^ ^^^^'^^ f.. . " . " 


movqb $(1) ,r6 


4.1 Instruction Execution Time , , , . , ^ ^ ,, . , ^> 


Ishb r4,r6 # r6 = 1 < < (bit mod 8) = 


The NS32GX32 is fully binary compatible to its predeces- mask with set bit 


sors from the Series 32000 family of microprocessors. The , ^ . 


NS32GX32's pipelined architecture and high frequency in- ' 


ternal clock enable programs written for the other members ^^^^ $(-3),r4 # r4 = bit div 8 = byte 


of the processor family to run faster, in general, on the where bit is set 


NS32GX32. orb r6,_bit_map (r4) 



Note the use of the "Ish" instruction, rather than the "ash" 
one. The former requires 3 cycles for execution, compared 
to 9 for the latter. Both instructions are equivalent in our 
case, as r1 (the bit to be set) is an unsigned quantity less 
than 4,000,000. 

4.1.2 Replacing the "MUL" Instruction 

The setup calculations done before the main loop of the 
algorithm include two multiplications by the x-dimension of 
the bit-map. 

The "mul" instruction takes 37 cycles to multiply by num- 
bers in the order of magnitude of a reasonable bit-map di- 
mension. For any specific application it is usually possible to 
replace the "mul" by one or a few faster instructions. If the 
dimension is a power of two, one "Ish" instruction, execut- 
ing in 3 cycles, is enough. 

In other cases the dimension can be factored into a sum/ 
difference of powers of two. The multiplication is then re- 
placed by a series of shifts and adds. An example of multi- 
plication by 2000 is demonstrated in Appendix D. 

4.2 Avoiding Register Interlocks 

In certain circumstances the flow of instructions in the 
NS32GX32 pipeline will be delayed when the result of an 
instruction is used as the source of the following instruction. 
One of these interlocks occurs in Section 4.1.1 — the in- 
struction "orb r6, bit map(r4)" immediately follows the 

calculation of an r4 in "Ishd $(-3), r4". To avoid this inter- 
lock, we can exchange the order of calculating the byte and 
the bit, as follows: 

movd rl,r0 

Ishd $(-3),r0 # rO = bit div 8 = byte 
where bit is set 

movd rl,r4 

andd $(7),r4 # r4 = bit mod 8 = which 
bit to set 

movqb r4,r6 

Ishb r4,r6 # r6 = 1 < < (bit mod 8) = 
mask with set bit 

orb r6,_bit_map(r0) 
Moving the calculation of the byte away from its use in the 
"orb" instruction saves about 2.5 cycles per bit, or another 
6% of the main loop time. Added to the optimization of 
4.1.1, this gives a potential improvement of about 14% com- 
pared to the straightforward "sbit" instruction. This, howev- 
er, is not the actual improvement we get: The new se- 
quence requires keeping extra registers free in the loop, 
forcing us to use memory for some of the algorithm vari- 
ables (see Section 4.3 below). This has an overhead of 
about 2% of the main loop time, giving a net improvement 
of about 12%. 

4.3 Data Cache Considerations 

The main loop of the Bresenham algorithm uses seven vari- 
ables: "c1" and "c2" the loop-constants, "p" the decision 
variable, "x-increment" and "y-increment", "bit" and "last- 
bit". If the "sbit" instruction is used, all of these can reside 
in registers, as the NS32GX32 has eight general-purpose 
registers. 



As mentioned above, the replacement of "sbit 

r1, bit map" with the above sequence has the cost of 

requiring three intermediate temporary values. These val- 
ues, with the addition of the algorithm's seven loop-used 
variables, give us the problem of deciding what should be 
put in registers and what in memory. This decision must be 
strongly influenced by the NS32GX32 data cache, and its 
write-through update policy. 

The most important consideration is that due to the write- 
through policy, a write to memory is more expensive than a 
write to a register. Thus, the loop-invariant variables ("c1", 
"c2", "last-bit", "x-increment" and "y-increment") are bet- 
ter candidates for allocation in memory rather than in a reg- 
ister. The additional temporary values for the "sbit" alterna- 
tive sequence are written, so registers should be allocated 
to them. Reading one of the variables from a memory loca- 
tion may also result in a data cache miss, because the 

" bit map" references may overwrite them. Thus the final 

choice for memory instead of registers was for "c2" and "x- 

increment". "last bit" was not chosen because it is read in 

every iteration of the loop. "c1" and "x-increment" are nev- 
er both read in any of the branches inside the loop, so for 
each bit set we always have only one read from memory. 

4.4 Optimizing Branch Instructions 

An additional improvement of 2.4% is achieved in Appendix 
B's assembly routine by optimizing the flow of branches in 
the main loop. This is a relatively complicated issue, so its 
details are given in Appendix D. 

4.5 Loop Unrolling 

Another method to speed up the main loop of the algorithm 
is to reduce the overhead of branches in this loop. The idea 
is to replicate the code in the loop, so that in each iteration 
two bits will be set, without any conditional branch between 
them. To ensure that there is an even number of bits to plot 
when entering the loop, we must add another test after set- 
ting the first bit before the loop, and perhaps plot another 
bit. This slightly lengthens the pre-loop execution time, but 
is worth doing for the reduction in the loop time. 
The general outline of the routine, after the initial calcula- 
tions, becomes: 

1 . set first bit 

2. if an odd number of bits remains, plot an extra bit. 

3. LOOP: 

a. plot bit 

b. plot bit 

c. if not last bit, goto LOOP 

Without the code replication, there is a test to check if it is 
the last bit for each bit plotted. The replication saves one 
such test and the delay associated with its conditional 
branch. 

The actual code (reverting to the simple "sbit" code) is giv- 
en in Figure 6. 
In Appendix B, an additional code replication is done, to 

save the "br next bit" instruction in the code above. The 

total improvement from the code replication is an additional 
5% over the code that sets one bit per iteration. 



The register and memory usage in this example are: 



1 rl 


- current bit to be set i 


1 r3 


- last bit to be set i 


i 32(sp) 


- cl I 


1 r2 


- C2 ! 


1 r7 


- p 1 


1 r5 


- z-increment I 


1 24(sp) 


- y-inorement i 


I 36Csp) 


- max(dx, dy) i 



PLOTTING OF THE LINE 



# 
# 

set_f irst_bit : 



sbitd rl , _bit_map 

cmpd rl,r3 

beq EXIT_SEQUENCE 



# plot first point by setting bit in bit_map 

# is the first bit also the last bit? 



# Test if no. of bits is odd or even: if the difference (maz(dx,dy)) 

# is odd, there is a total of an even no. of bits, and vice versa. 

# The first bit has been plotted, so we must decide if to plot another 

# bit before the loop (which plot two bits in each iteration) or not. 



tbitd 
bfc 

.align 4 



$(0),36(sp) 
MAIN_LOOP 



# ESTRA BIT SET FOR EVEN NO. OF BITS 
# 



PRE_LOOP : 

cmpqd 
bgt 



$(0),r7 
p_negative 



if p < then 

addd r2 , r7 

addd 

addd 

sbitd 



24(sp),rl 

r5,rl 

rl,_bit_map 



cmpd 

bne 

br 

.align 4 
p_negative: 



rl,r3 

MAIN_LOOP 

EXIT_SEQUENCE 



? 



# p = p + C2 

# bit = bit + y-increment 

# bit = bit + x-increment 

# plot by setting bit in bit_map 

# have we just plot the last bit? 
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FIGURE 6 



addd 


32(sp),r7 


# 


p = p + cl 




addd 


r5,rl 


# 


bit = bit + x-increment 




sbitd 


rl ,_bit_map 


* 


plot by setting bit in bit_map 




cmpd 


rl,r3 


# 


have we just plot the last bit? 




beq 


EXIT_SEQUENCE 










======= MAIN 




LOOP ================ ■=== 




MAIN^LOOP : 










cmpqd 


$(0),r7 


# 


is p < ? 




bgt 


p_negativel 








# if p < tlien 








addd 


t2,t7 


# 


p = p + c2 




addd 


24(sp),rl 


# 


bit = bit + y-increment 




addd 


r5,rl 


# 


bit = bit + x-increment 




sbitd 


rl ,_bit_map 


# 


plot by setting bit in bit_map 




br 


next_bit 


# 


No need to test if last bit 




. align 


4 








p negativel : 










addd 


32(sp),r7 


# 


p = p + cl 




addd 


r5,rl 


# 


bit = bit + x-increment 




sbitd 
# 
# unrolled r* 

# 


rl,_bit_map 


# 


plot by setting bit in bit_map 




apetition 








next_bit : 










cmpqd 


$C0),r7 


# 


is p < ? 




bgt 
# 

# if p < t] 
# 

addd 


p_negative2 








aen 








r2,r7 


# 


p = p + G2 




addd 


24(sp).rl 


+ 


bit = bit + y-increment 




addd 


rS.rl 


* 


bit = bit + x-increment 




sbitd 


rl,_bit_inap 


# 


plot by setting bit in bit_map 




cmpd 


rl,r3 


# 


have we just plotted the last bit? 




bne 


MAIN LOOP 








br 


EZIT_SEQUENGE 








•align 


4 








p negative2: 










addd 


32Csp),r7 


# 


p = p + cl 




addd 


rS.rl 


# 


bit = bit + x-increment 




sbitd 


rl ,_bit_map 


# 


plot by setting bit in bit_map 




cmpd 


rl,r3 


# 


have we just plotted the last bit? 




bne 


MAIN_LOOP 


OF 


T TTJT? PT OTTTTin 




# ============ 

EXIT_SEQUENCE : 


=========== END 


LiXJMri r^ijL/i. ± XiNvj = = = = = = = = — _ 
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FIGURE 6 (Continued) 





5.0 CONCLUSION Several variations of the Bresenham algorithm have been 

An optimized Bresenham line-drawing algorithm has been developed. One particular variation by Bresenham himself 

presented for the NS32GX32 microprocessor. The optimiza- ""^''^s on "run-length" segments of the line for speed opti- 

tions used are relevant to general coding for the mization. This is explored in Application Note AN-522. 
NS32GX32. Appendix C gives timing results for the imple- 
mentations given in Appendix A and Appendix B. 

APPENDIX A. IMPLEMENTATION IN C 

* File "bres.h" 

* global definitions and declarations for the 

* program using Bresenham line drawing algorithm. 

* / 

#define dimension 2000 /* must be a multiple of 8 */ 

#define xbytes (dimension/8) /* number of bytes along z_azis */ 

#define warp dimension /* number of bits along z_axis */ 

tdefine maxx (warp)-l /* highest x coordinate */ 

tdefine maxy ( dimension- 1) /* highest y coordinate */ 

#define y_lines (maxy+1) /* number of lines along y axis */ 

unsigned char bit_map[y_lines*xbytes] ; /* array containing bit-map */ 

/* end of "bres.h" */ 

/* 

* File "bres_line.c" 

* Implementation of Bresenham 's line drawing algorithm 

* in the C programming language . 

*/ 

# include " br e s . h " 

/* 

* sbitCpos, buffer) 

* unsigned int pos; 

* char buffer [ ] ; 

* bit 'pos' is set by calculating pos MOD 8, which is the same as (pos ^7). 

* Then 1 is shifted by this amount to obtain a mask of the bit to set in 

* byte pos/8 of 'buffer', which is the same as (pos >> 3). 

* 

*/ 
#define sbit(buffer, pos) (buffer) [ (pos) > >3] i= l << (((char)pos) & 7) 

* line_draw(xs, ys, xf, yf) 

* int xs , ys , xf , yf ; 

* Draws a line on "bit_map", from coordinates (xs, ys) to (xf, yf ) , using 

* Bresenham 's iterative method. 

*/ 

line_draw(xs, ys, xf, yf) 

int xs, ys, xf, yf; 

{ 

int dx, dy, x_inc, y_inc, /* deltas and increments */ 

p, cl, c2; /* decision variable p and constants */ 

unsigned bit, last_bit ; /* current and last bit positions */ 

dx = xf - xs; 
dy = yf - ys; 

bit = ys * warp + xs; /* initialize bit to first bit pos */ 

TL/EE/10434-15 



last_bit = yf * warp + xf ; /* calculate last bit on line */ 

if (abs(dy) > abs(dz)) { /* abs(slope) > 1 must rotate space. */ 

if (dy > 0) 

x_ino = warp; /* x-azis is now original y-axis */ 

else 

x_inc = -warp ; 

if (dx > 0) 

y_inc =1; /* y-axis is now original x-axis */ 

else 

y_inG = -1; 

/* Calculate Bresenliam's constants: */ 
cl = 2 * abs(dx) ; 
c2 = 2 * (abs(dx) - abs(dy)); 

p = 2 * abs(dx) - abs(dy); /* p is decision variable now rotated * 
} 

else { /* abs( slope) < 1 - use original axis * 

if (dy > 0) 

y_inc = warp; /* y_inc is +/-warp number of bits */ 

else 

y_inc = -warp; 

if (dx > 0) 

x_inc = 1 ; 
else 

x_inc = -1; 

/* Calculate Bresenham's constants: */ 
cl = 2 * abs(dy); 
c2 = 2 * (abs(dy) - abs(dx)); 
p = 2 * abs(dy) - abs(dx); 
} 

/*============ Bresenham's Algorithm ==============*/ 

sbit(bit_map, bit); /* draw the first point */ 

while (bit != last_bit) { /* once for each increment, */ 

/* i.e. , dx times */ 
if (p < 0) /* no y movement if p < */ 

p += cl ; 
else { 

p += c2; 
bit += y_inc; 
} 

bit += x_inc; /* always increment x */ 

sbit (bit_map , ( int )bit ) ; 
}; 
} /* end line_draw() */ 

/* end of "bres_line.c" */ 
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APPENDIX B. IMPLEMENTATION IN ASSEMBLY LANGUAGE 

# — line_draw.s -- 

# _line_draw routine draws a line in a memory bit -map with dimensions 

# DIM X DIM, from coordinates (xs,ys) to (zf,yf). 

# Both coordinates are assumed to be inside the bit-map. 



# 


Register and memory i 


1 Parameters i 


# 


Use in Loop i 




# 




1 24(sp) = xs 1 


# 


32(sp) = cl constant 1 


1 28(sp) = ys 1 




r2 = c2 constant 1 


1 32(sp) = xf 1 


# 


r7 = p decision variable i 
rl = current bit position i 


1 36(sp) = yf 1 








r3 = last bit to be drawn i 






r5 = x-increment I 






24(sp) = y-increment i 






rO = byte with set bit i 




# 1 r4 = bit set in byte i 

* + + 





.file " line_draw . s " 

.set DIM, 2000 

.comm _bit_map, 500000 

. globl _line_draw 



# change the 500,000 below if you change this 

# 2000x2000 = 4,000.000 bits = 500,000 bytes 

# "export" _line_draw to the world 



.align 4 
_line_draw : 

# The following sequence replaces 'save [r3,r4 r7] ' . 

# Together with the replacement for 'restore' at the EXIT_SEQUENCE , 

# a saving of 28 cycles per each call of the routine. 

movd r3,tos 

movd r4,tos 

movd r5,tos 

movd r6,tos 

movd r7,tos # end of 'enter' sequence 



we get 



movd 


24(sp) 


r7 


# load xs 


movd 


28(sp) 


r2 


# load ys 


movd 


32(sp) 


r5 


# load xf 


movd 


36(sp) 


rO 


# load yf 



# Calculate first bit to plot = rl = xs + DIM*ys 
# 

movd r2,rl 
muld $CDIM),rl 

addd r7,rl # first bit to plot = rl = xs + DIM*ys 

# 

# Calculate last bit to plot = DIM*yf + xf 
# 

movd rO , r3 

muld $(DIM),r3 

addd r5,r3 # r3 = last_bit = DIM*yf + xf 



subd 



r2,r0 



# rO = dy = yf 



ys 
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# r6 = abs(dy) 



movd 


r5,r4 






subd 


r7,r4 


# r4 = 


= dz = xf 


absd 


r4,r5 


# r5 = 


= abs(dx) 



# 

4 Setting the values fc the LOOP: 

# 

cmpd r6,r5 # is Idyl > idx!? 

ble abs_y_less 

# 

# idyl > idxt slope > 1; rotate coordinate system 

movd r6,36(sp) * save dy, to test even/odd no. bits 
# 

# calculate the loop constants cl , c2, p 
# 

addd r5,r5 # cl = 2*idxi Note: r5 is temporary for cl. 

movd r5,r2 # r2 = cl 

subd r6,r2 # r2 = cl-ldyl 

subd r6,r2 # r2 = cl-2*ldyi = 2* I dz I -2* I dy I = c2 

movd r5,r7 # r7 = 2* Idxl 

subd r6,r7 # p = r7 = 2 * I dzi - i dy i 

movd r5,32(sp) # 32(sp) is now cl 
# 

# calculate z-increment, y-inorement 
# 

cmpqd $(0),rO # is dy > 0? 

bge negative_dy 

movd $(DIM),r5 # x-increment = DIM 

br get_y„increment 

.align 4 
negative_dy: 

movd $(-DIM),r5 # x-increment = -DIM 
get__y_inGrement : 

cmpqd $(0),r4 # is dx > 0? 

bge negative_dx 

movqd $(l),24(sp) # y-increment = 1 

br set_first_bit 

.align 4 
negative_dx: 

movqd $(-l),24(sp) # y-increment = -1 

br set_first_bit 

.align 4 
# 

# idxi >= Idyl, e.g., slope <= 1; normal coordinate system 
# 

abs_y_less : 

movd r5,36(sp) # save dx, to test even/ odd no. bits 



TL/EE/10434-18 



11 



# calculate the loop constants cl , c2, p 
# 

addd r6,r6 # r6 = cl = 2*idyi 

movd r6,r2 # r2 = 2* idyl 

subd r5,r2 # r2 = r2 - i dx t = 2* i dy i - I dxi 

subd r5,r2 # c2 = 2* I dy I - 2* I dx I 

movd r6,r7 # r7 = 2* Idyl 

subd r5,r7 # p = r7 = 2* idy i - i dx I 

movd r6,32(sp) # save ol, free r6 for other use 
# 

# calculate x-increment , y-increment 
# 

cmpqd $(0),r4 # Is dx > 0? 

bge dx_negative 

movqd $(l),r5 # x-increment = 1 

br y_increment_get 

•align 4 
dx_negative : 

movqd $(-l),r5 # x-increment = -1 

y„increment_get : 

cmpqd $(0),rO # is dy > 0? 

bge dy_negative 

movd $(DIM) ,24(sp) # y-increment = DIM 

br set_first_bit 
dy_negative : 

movd $(-DIM),24(sp) # y-increment = -DIM 

.align 4 

# 

# PLOTTING OF THE LINE 
# 

set_f irst_bit : 

# sbitd rl,_bit_map # plot first point by setting bit in bit_map 



movd rl.rO 



# rO = bit div 8 = byte where bit is set 

# r4 = bit mod 8 = which bit to set 

# r6 = 1 << (bit mod 8) = mask with set bit 

# is the first bit also the last bit? 



# 

# Test if no. of bits is odd or even: if the difference (max(dx,dy)) 

# is odd, there is a total of an even no. of bits, and vice versa. 

# The first bit has been plotted, so we must decide if to plot another 

# bit before the loop (which plot two bits in each iteration) or not. 
# 

tbitd $(0),36(sp) 
bfc MAIN_LOOP 



Ishd 


$(-3),r0 


movd 


rl,r4 


andd 


$(7),r4 


movqb 


$(l),r6 


Ishb 


r4,r6 


orb 


r6,_bit_map(r0) 


cmpd 


rl,r3 


beq 


EXIT_SEQUENCE 
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.align 
# 
# EXTRA BIT SET 

# 


4 








FOR EVEN NO. OF 


BITS 




PRE_LOOP : 










cmpqd 


$(0),r7 


# 


is p < ? 




bgt 


p_negative 








# if p < th 
# 

addd 


en 








t2,t7 


# 


p = p + c2 




addd 


24(sp),rl 


# 


bit = bit + y-increment 




addd 


r5,rl 


# 


bit = bit + x-increment 




# sbitd 


rl, bit map 


# 


plot by setting bit in bit_map 




movd 


rl,rO 








Islxd 


$(-3),r0 


# 


rO = bit div 8 




movd 


rl,r4 








andd 


$(7),r4 


# 


r4 = bit mod 8 




movql) 


$(l),r6 








Ishb 


r4,r6 


# 


r6 = 1 < < (bit mod 8) 




orb 


r 6 , _bit_map ( rO ) 








cmpd 


rl,r3 


# 


have we just plotted the last bit? 




bne 


MAIN LOOP 








br 


EXIT SEQUENCE 








. align 


4 








p negative: 










addd 


32(sp),r7 


# 


p = p + cl 




addd 


r5,rl 


# 


bit = bit + x-increment 




# sbitd 


rl,_bit_map 


# 


plot by setting bit in bit_map 




movd 


rl.rO 








ishd 


$(-3),r0 


# 


TO = bit div 8 




movd 


rl,r4 








andd 


$(7),r4 


# 


r4 = bit mod 8 




movqb 


$(l),r6 








Ishb 


r4,r6 


# 


r6 = 1 < < (bit mod 8) 




orb 


r6,_bit_mapCrO) 








cmpd 


rl.rS 


# 


have we just plotted the last bit? 




beq 

# 


EXIT_SEQUENGE 








# 
MAIN_LOOP : 


====== MAIN 




LOOP ================ 












cmpqd 


$(0),r7 


# 


is p < ? 




bgt 
# 

# if p < then 
# 

addd 


p_negative2 
















r2,r7 


# 


p = p + c2 




addd 


24(sp),rl 


# 


bit = bit + y-increment 




addd 


r5,rl 


# 


bit = bit + x-increment 
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# sbitd 


rl, bit map 


# 


plot by setting bit in bit_map 






movd. 


rl.rO 










Ishd 


$(-3),r0 


# 


rO = bit div 8 






movd 


rl,r4 










andd 


$C7),r4 


# 


r4 = bit mod 8 






movqb 


$(l),r6 










Ishb 


r4,r6 


# 


r6 = 1 < < (bit mod 8) 






orb 


r6,_bit_map(r0) 








# 
# 


unrolled repetitionl 








cmpqd 


$(0),r7 


# 


is p < ? 




# 
# 


bgt 


p_negativel_2 








if p < then 










addd 


r2,r7 


# 


p = p + c2 






addd 


24(sp),rl 


# 


bit = bit + y-increment 






addd 


rS.rl 


# 


bit = bit + x-increment 






# sbitd 


rl,_bit_map 


# 


plot by setting bit in bit_map 






movd 


rl.rO 










Ishd 


$(-3),r0 


# 


rO = bit div 8 






movd 


rl,r4 










andd 


$(7),r4 


# 


r4 = bit mod 8 






movqb 


$Cl),r6 










Ishb 


r4,r6 


# 


r6 = 1 << (bit mod 8) 






orb 


r6,_bit_map(r0) 










ompd 


rl,r3 


# 


have we just plotted the last bit? 






bne 


MAIN LOOP 










br 


EXIT_SEQUENCE 










. align 


4 








P- 


negativel_2: 












addd 


32(sp).r7 


# 


p = p + cl 






addd 


r5,rl 


# 


bit = bit + z-inorement 






# sbitd 


rl ,_bit_map 


# 


plot by setting bit in bit_map 






movd 


rl,rO 










Ishd 


$(-3),r0 


# 


rO = bit div 8 






movd 


rl , r4 










andd 


$(7),r4 


# 


r4 = bit mod 8 






movqb 


$(l),r6 










Ishb 


r4,r6 


# 


r6 = 1 < < (bit mod 8) 






orb 


r6,_bit„map(r0) 










cmpd 


rl,r3 


# 


have we just plotted the last bit? 






bne 


MAIN LOOP 










br 


EXIT_SEQUENCE 










.align 


4 








P- 


negatives : 












addd 


32(sp),r7 


# 


p = p + cl 






addd 


r5,rl 


# 


bit = bit + z-inorement 
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# sbitd rl,_bit„map 



movd 


rl,rO 


islid 


$(-3),r0 


movd 


rl,r4 


andd 


$(7),r4 


movqb 


$(l),r6 


Ishb 


r4.r6 


orb 


r6,_bit_map(r0) 


# 

# unrolled repetition2 


cmpqd 


$(0),r7 


bgt 


p_negative2_2 


# if p < then 




addd 


r2,r7 


addd 


24(sp),rl 


addd 


r5,rl 


# sbitd 


rl, bit map 


movd 


rl,rO 


Islid 


$(-3),rO 


movd 


rl,r4 


andd 


$(7),r4 


movqb 


$(l),r6 


Ishb 


r4,r6 


orb 


r6,_bit_map(r0) 


cmpd 


rl,r3 


bne 


MAIN LOOP 


br 


EZIT_SEQUENCE 


. align 


4 


p negative2_2: 




addd 


32(sp),r7 


addd 


r5,rl 


4 sbitd 


rl, bit map 


movd 


rl.rO 


ishd 


$(-3),r0 


movd 


rl,r4 


andd 


$(7),r4 


movqb 


$(l),r6 


Ishb 


r4,r6 


orb 


r6,_bit_map(r0) 


cmpd 


rl,r3 


bne 


MAIN_LOOP 



# plot by setting bit in bit_map 

# rO = bit div 8 

# r4 = bit mod 8 

# r6 = 1 < < (bit mod 8) 

# is p < ? 



# p = p + c2 

# bit = bit + y-increment 

# bit = bit + x-increment 

# plot by setting bit in bit_map 

# rO = bit div 8 

# r4 = bit mod 8 

# r6 = 1 << Cbit mod 8) 

# has the last bit been plotted? 



4 p = p + el 

4 bit = bit + x-inorement 

4 plot by setting bit in bit_map 

4 rO = bit div 8 

4 r4 = bit mod 8 

4 r6 = 1 < < (bit mod 8) 

# has the last bit been plotted? 



END OF MAIN LOOP 



EXIT_SEQUENCE : 



4 replace the 'restore 


movd 


tos,r7 


movd 


tos,r6 


movd 


tos,r5 


movd 


tos,r4 


movd 


tos,r3 


ret 






[r3, 



, r7]' instruction: 
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APPENDIX C. TIMING PERFORMANCE OF THE cy, time was measured for 100 iterations of the Star-Burst 

NS32GX32 MICROPROCESSOR image, and divided by 100 for the results given in Figure 8. 

Timing was measured on the Star-Burst image of Figure 7. 1^''® "bres.h" is in Appendix A. 

The driving "main" program in C (given below) was used to 
call the assembly "line draw" routine. For greater accura- 

# include "bres.li" 

* mainO 

* Generates the Star-Burst image 

* • 

main( ) 
{ 

int i , count ; 

for (count = 1; count <= 100; count++) { 

for (i = 0; i <= maxy; i += 25) 

line_draw(0, i, mazx, maxy-i); 

for (i = 0; i <= maxx; i += 25) 

line_draw(i, maxy, maxx-i, 0); 
} 
} 

/* end of "main.c" */ 
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Star-Burst Benchmark — This Star-Burst image was drawn on a 2000 x 2000 pixel bit-map. Each line is 2000 pixels in length and 
image, bisecting the square. 

The lines are 25 pixels apart, and were drawn using the "line draw.s" routine. There is a total of 160 lines. 

The total drawing time for this image was 0.44 seconds on a 25 MHz NS32GX32. 

FIGURE 7. Graphics Image (2000 x 2000 Pixels), 300 DPI 
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Parameter 


Time on 25 MHz 
NS32GX32 


Setup Time per Line 
Lines per Second 
Pixels per Second 


4.82 jaS 

364.3 

728,531 


Total Time of Star-Burst Image 


439.5 ms 



FIGURE 8. Timing Performance for the Star-Burst 

Image o\ Figure 7. The Whole Image Consists 

of 160 Lines of 2000 Pixels Each. 

The setup time was measured from the start of the 

" line draw" routine, up to the setting of the first bit. The 

overhead of the "main" routine and of calling the 

" line draw" routine are not included. 

The number lines-per-second includes only net time in the 

" line draw" routine itself, including setup time. It was 

calculated as follows: The overhead of the "main" driving 

routine, including its call of " line draw", was subtracted 

from the time measured for the whole image. The difference 
was divided by 100 (the amount of iterations of the whole 
image) and then by 160 (the amount of lines per image). 
This gives the time for one line. The reciprocal of this time is 
the number of lines per second drawn. 
The numbers of Pixels-per-second is Lines-per-second mul- 
tiplied by 2000 (the number of pixels-per-line). 
The total time for the image includes the overhead of the 
"main" routine. 

Note: The total time for the C version of the "line draw" routine (Appendix 

A), as compiled by the GNX version 3 C optimizing compiler, was 
850 ms. 

APPENDIX D. REPLACING THE "MUL" INSTRUCTION 

As mentioned in Section 4.1.2, two multiplications are need- 
ed in the setup calculations before the algorithm's main 
loop. A "mul" instruction takes 37 cycles to execute — much 
more than the average of 3.5 cycles per instruction. 
In this appendix, we show a specific example, suitable for a 
bit-map with a x-dimension of 2000 pixels. We represent 
2000 as 16 * (128 - 3). "muld $(2000),r0" is replaced by 
the following sequence, saving about 24 cycles per multipli- 
cation (a 3X speedup): 



movd 


r0,r3 








Ishd 
movd 


$(7),r3 
r0,r6 


# 


r3 = 128 * rO 




addd 


r6,r6 


# 


r6 = rO + rO = 2 


* rO 


addd 


r0,r0 


# 


r6 = r6 + rO = 3 


* rO 


subd 


r6,r3 


# 


r3 = r3 - r6 = 
rO = 125 * rO 


(128 


Ishd 


$(4) ,r3 


# 


r3 - 16 * r3 - 
rO = 2000 * rO 
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APPENDIX E. OPTIMIZING BRANCH INSTRUCTIONS 

One of the features of the NS32GX32 that enables it to 
achieve its high performance is the incorporation of a pipe- 
lined instruction processor. 

The flow of instructions through the pipeline is delayed 
when the address from which to fetch an instruction de- 
pends on a previous instruction, such as when a conditional 
branch is executed. The loader includes special circuitry to 
handle branch instructions, which calculates the destination 
address and selects between the sequential and non-se- 
quential streams. 

An incorrect prediction of the branch causes a breakage in 
the instruction pipeline, resulting in a delay of 4 cycles. 
For conditional branches the branch is predicted taken if it is 
backward or if the tested condition is NE or LE. A branch 
predicted incorrectly, whether taken or not, causes the de- 
lay of 4 cycles. On the other hand, a branch predicted cor- 
rectly causes a delay of 1 cycle if it is taken, and no delay if 
not taken. Thus in the main loop we use 

cmpqd $(0),r7 # is p < ? 

bgt p_negative2 

rather than 

cmpqd $(0),r7 #isp<0? 

ble p_non_negative2 

On the average, half of the branches are taken. For the half 
incorrectly predicted there is no difference between the two. 
With ble, however, the prediction is "branch taken", with a 
delay of 1 cycle when correct, while with bgt the prediction 
is "not taken", with no delay when correct. This gives an 
additional 2.4% improvement in loop time. 
Similarly, we use 

bne MAIN.LOOP 

br EXIT.SEQUENCE 

near the end of the loop, rather than 

beq EXIT.SEQUENCE 

br MAIN.LOOP 

because the second sequence would result mostly in a 
wrong prediction. 
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LIFE SUPPORT POLICY 

NATIONAL'S PRODUCTS ARE NOT AUTHORIZED FOR USE AS CRITICAL COMPONENTS IN LIFE SUPPORT 
DEVICES OR SYSTEMS WITHOUT THE EXPRESS WRITTEN APPROVAL OF THE PRESIDENT OF NATIONAL 
SEMICONDUCTOR CORPORATION. As used herein: 



1. Life support devices or systems are devices or 2. A critical component is any component of a life 



systems which, (a) are intended for surgical implant 
into the body, or (b) support or sustain life, and whose 
failure to perform, when properly used in accordance 
with instructions for use provided in the labeling, can 
be reasonably expected to result in a significant injury 
to the user. 



support device or system whose failure to perform can 
be reasonably expected to cause the failure of the life 
support device or system, or to affect its safety or 
effectiveness. 
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